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t first glance, the concept of data compression 
seems too good to be true. The idea of shrink¬ 
ing information without losing any of it looks 
to be a something-for-nothing proposition 
that violates what should be one of Newton’s 
lesser-known laws: the law of conservation of data. 

Despite the aura of mystique that surrounds it, data 
compression is based on a simple idea: mapping the 
representation of data from one group of symbols to an¬ 
other, more concise series of symbols. Data-compres- 
sion programs and dedicated compression hardware use 
several different algorithms to achieve this end. 

Two compression schemes, Huffman coding and 
LZW coding (for Lempel and Ziv, its creators, and 
Welch, who made substantial modifications), form the 
basis for much of the compression that we use from day 
to day. These techniques also represent two distinct 
schools of compression algorithms. An understanding 
of how each algorithm works provides an excellent 
background in compression in general. 

Both Huffman and LZW coding are lossless com¬ 
pression techniques. They are appropriate to use for 
compressing any kind of data because the expanded 
representation is identical to the original input to the 
compressor. Joint Photographies Experts Group 
(JPEG), Motion Picture Experts Group (MPEG) (see 
“Putting the Squeeze on Graphics,” December 1990 
BYTE), and other cutting-edge image-compression al¬ 
gorithms achieve fantastic compression ratios at the ex¬ 


pense of exact data reproduc- TWO algorithms 
tion. These techniques work 
well for images and sound 
data, but they are not appro¬ 
priate for general data. 

Huffman coding, original¬ 
ly proposed sometime in the 
early 1950s, reduces the num¬ 
ber of bits used to represent 
frequent characters and in¬ 
creases the number of bits used for infrequent charac¬ 
ters. The LZW method, on the other hand, encodes 
strings of characters, using the input stream to build an 
expanded alphabet based on the strings that it sees. 
These two very different approaches both work by re¬ 
ducing redundant information in the input data. 


Huffman coding and 
LZW coding—are 
at the root of most 
compression 


Huffman Coding 

Huffman coding is probably the best-known method of 
data compression. The simplicity and elegance of the 
technique have made it a longtime academic favorite. 

But Huffman codes also have practical applications; for 
example, static Huffman codes are used as the last 
stage of JPEG compression. The MNP-5 data-compres- 
sion standard for modems (see “4800 Bits, No Errors,” 

June 1989 BYTE) uses dynamic Huffman compression 
as part of its process. Finally, Shannon-Fano coding, a 
close relative of Huffman coding, is used as one stage 
in PKZIP’s powerful “imploding” algorithm. 
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the symbol’s leaf. Suppose the compres¬ 
sor wants to encode the letter s. It starts 
at the leaf corresponding to s and jumps 
to the parent node, noting which branch 
(0 or 1) it was on. It continues to jump up 
the tree until it reaches the root. The list 
of branches, when reversed, describes 
the path from the root to s : This is the 
symbol’s Huffman code. 

High-probability characters are close 
to the root, so their codes are short. Low- 
probability characters are far from the 
root and have longer codes. 

To decode, the decompressor takes 
the code and processes it in reverse. That 
is, it starts at the root of the tree. If the 
first bit in the code is a 1, it jumps to the 
node on the 1-branch from the root. It 
continues reading bits and jumping until 
it reaches a leaf; the symbol at the leaf is 
the decoded character. 

One more property of the Huffman 
tree bears discussion. Because symbols 
are always leaves, symbol nodes never 
have any children. When the decompres¬ 
sor gets to a leaf node, it knows to stop 
reading from the input immediately be¬ 
cause it knows it has reached a leaf. In 
other words, one Huffman code is never 


the prefix of another. This means that al¬ 
though code lengths are variable, the 
compressor always knows when one code 
ends and another begins, and there is no 
need to explicitly place delimiters be¬ 
tween codes. 

Dynamic Huffman Coding 

The greatest difficulty with Huffman 
codes, as you probably noticed from the 
discussion above, is that they require a 
table of probabilities for each type of data 
to compress. This is not a problem if you 
know you will always compress English 
text; you simply provide a suitable En¬ 
glish text tree to the compressor and de¬ 
compressor. The JPEG protocol defines 
a default Huffman tree for compressing 
JPEG data. In the general case, when you 
don’t know the symbol probabilities for 
your input data, static Huffman codes 
can’t be used effectively. 

Fortunately, a dynamic version of 
Huffman compression can construct the 
Huffman tree on the fly while reading 
and actively compressing. The tree is 
constantly updated to reflect the chang¬ 
ing probabilities of the input data. 

Listing 1 contains a pseudocode ver¬ 


sion of a dynamic Huffman compres¬ 
sion/decompression program. The actual 
code, which is available from the usual 
sources, is written in 8088 assembly lan¬ 
guage. These programs are based on an 
algorithm described in reference 1, 
which cites a number of original sources. 
Reference 2 presents a more efficient, al¬ 
though complex, algorithm for dynamic 
Huffman compression. 

The key to starting with an uninitial¬ 
ized tree is the introduction of an empty 
leaf. The empty leaf is simply a leaf node 
with no symbol attached to it; this leaf 
has zero probability. The initial tree, 
held by both the compressor and de¬ 
compressor, has only the root and a sin¬ 
gle empty leaf. 

The compressor starts the ball rolling 
by reading in a character. It attaches this 
character to the 1-branch of the root, 
leaving the empty leaf on branch 0. It 
then sends this character to the decom¬ 
pressor as a literal ASCII code, and the 
decompressor makes the same adjust¬ 
ment to its tree. 

For each character read thereafter, the 
compressor performs the following 
steps. First, it checks to see if the code is 
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in the encoding tree. If the code is there, The decompression program can 

the compressor sends it in the same fash- make adjustments to its tree because it 
ion as in the static case. If not, it sends has exactly the same tree as the compres- 
the code for the empty leaf. Then it sends sor. When it receives an empty leaf code, 
the new character as a literal ASCII it reads the next code from the corn- 
code. Finally, the compressor adds two pressed data as an ASCII literal. It then 
codes, one for a new empty leaf on employs the same update routine as the 
branch 0 and one for the new code on compressor uses to update the tree, 
branch 1. When the tree is full (i.e., The empty leaf and the uninitialized 
when all characters have been seen), the tree don’t solve the problem of keeping 
compressor just changes the last empty track of changing probabilities, however, 

leaf node into the last character. To do that, you need to introduce weights 



to each node in the tree and update these 
weights as you process the input data. 
You also need to maintain a list of node 
designations (and weights) sorted by 
weight. 

Each character starts at weight 1 (the 
empty leaf starts at 0). Whenever the 
compressor transmits a character that 
is in the table, it increments the weight 
of that character’s node. If this change 
makes the character node heavier than 
nodes that are listed higher in the weight 
list, the compressor swaps the character 
node with the heaviest node that is lighter 
than the character node. By swapping, I 
mean trading parent nodes and branch 
designations only; the children of the 
swapped nodes are not affected, so there 
is no danger of a leaf node becoming in¬ 
ternal, or an internal node becoming a 
leaf. 

The compressor then jumps up the tree 
to the character’s parent, which may 
have changed with the last swap. It con¬ 
tinues the process with the parent and on 
up the tree until it gets to the root. 

The figure shows the early stages of 
dynamic Huffman tree construction for a 
very simple input. You can follow the ad¬ 
dition of new leaves via the empty leaf 
mechanism as well as by node swapping 
in this diagram. 

Huffman Gotchas 

As usual, there are a few snags when 
you’re actually implementing the dy¬ 
namic algorithm, regardless of its ele¬ 
gance. The first problem is that you can’t 
perform node swapping while transmit¬ 
ting a code, although both require you to 
start at a character node and hop up the 
tree parent by parent. You can’t do the 
two procedures at the same time, be¬ 
cause swapping nodes causes the parent 
to change, which causes the code trans¬ 
mitted to change. You would send a code 
to the decompressor before it knows what 
to do with it. 

A way around this dilemma is to make 
two passes in the compressor—one for 
transmitting and one for updating. The 
decompressor also makes two passes— 
one for receiving (going down the tree) 
and one for updating (going back up). 

The second problem occurs because of 
the empty leaf. Because the empty leaf 
has zero weight, it is possible for a sibling 
of the empty leaf to become heavier than 
its parent at the start of the update pro¬ 
cess. However, swapping between child 
and parent will scramble the tree, leaving 
the parent as its own child. Fortunately, 
simply aborting any swap between child 
and parent solves the problem. 

Finally, there isn’t any way for the 
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The Huffman encoding tree changes to respond to changing character probabilities 
in this dynamic example. At first, all the transmissions are empty leaf code /literal 
character combinations. When i and s are transmitted a second time, the compressor 
uses their codes instead of literals. As s is reused, it moves higher up the tree, 
shortening its corresponding code. 


LZW COMPRESSION 


Table 1 : An instance of compression, at code 258. The compressor saves a code by 
transmitting 258 instead of is, the literal representation. Strings are stored in the 
LZW table as code-character combinations rather than full strings. 

Input Compression table Compressed string Expansion table 

T 


256<-7"+/7 
257-h+i 
258<-/+s 
(space) 259<-s+(space) 

/ 260<-(space)+/ 


decompressor to detect the end of trans¬ 
mission if the compressor must send out 
full bytes (as in a file-compression pro¬ 
gram). Suppose, for example, a trans¬ 
mission is 81 bits long. When the de¬ 
compressor reads the first bit of the 
eleventh byte, it has no way of knowing 
that it’s the last significant bit and that 
the remaining 7 are garbage. Therefore, 
the file-compression code must prepend 
a file length to the compressed data, 
making it a few bytes longer. 

LZW Compression 

The LZW algorithm, which was first 
presented by Welch in 1984 (see refer¬ 
ence 3), has become a widely used tech- 


256- T+h 

257— h+i 
238-i+s 

(space) 259<-s+(space) 


nique during the last few years. Compu- 
Serve’s GIF file format uses LZW 
compression, as do ARC, Unix’s com¬ 
press, Stuff it, and PKZIP. The algo¬ 
rithm itself is patented by Sperry. 

Although straightforward in concept, 
the LZW algorithm can be a little diffi¬ 
cult to implement on a real machine with 
real constraints. Despite some complex¬ 
ities, however, the technique is powerful 
and fast enough to make it popular. 

LZW works by extending the alpha¬ 
bet—it uses the additional characters to 
represent strings of regular characters. 
To use LZW compression on 8-bit ASCII 
codes, you extend the alphabet by using 
9-bit or larger codes. The additional 256 


characters that the 9-bit code gives you 
are used to store strings of 8-bit codes, 
which are determined from strings in the 
input. 

The compressor maintains a string 
table with strings and their correspond¬ 
ing codes. The string table corresponds 
to the extended alphabet. Initially, the 
compressor starts with a string table with 
only the 256 literal codes defined. If 
you’re using 9-bit codes, the string table 
has an additional 256 empty entries; if 
you’re using 10-bit codes, it has 768 
empty entries, and so on. 

The compression algorithm works like 
this: Start with a null string. Read in a 
character, and append it to the string. If 
the string is in the string table, continue 
reading and appending characters until 
you find a string that is not. Add this 
string to the string table. Write the code 
for the last known string that matched the 
output. Use the last character as the basis 
for the next string, and continue reading 
until you run out of input. That’s really 
all there is to it. 

Table 1 shows an example of LZW 
compression, using the same simple in¬ 
put in the figure. The compressor reads 
in the initial T and appends it to the null 
string. The string 7is a literal character, 
so it is in the table. Next, the compressor 
reads an h and looks up Th in the string 
table, where it doesn’t find it. It adds Th 
to the table at the next available position 
and sends out the last known string, T. It 
continues reading characters and adding 
strings until the input is exhausted. 

This short and simple sample input 
shows only one instance of compression, 
when the code 258 is sent out instead of 
the string is. If I were using 9-bit codes, I 
would have sent eight 9-bit codes to rep¬ 
resent This is a for 9 bytes either way and 
for break-even performance. Longer, 
more realistic inputs, of course, let you 
build a longer and more effective string 
table. The more repetitive that strings ap¬ 
pear, the more you can compress. 

Unfortunately, this simple compres¬ 
sion algorithm eats memory like pop¬ 
corn. Every time the compressor finds a 
new string, it adds it to the table. Each 
string that it adds is of variable length, 
which can lead to a storage nightmare. 

Luckily, there is a simple way out. As 
you may have noticed, each new string is 
actually an old string plus a new charac¬ 
ter. Instead of storing strings explicitly, 
you can store them as code and appended 
character combinations. Table 1 shows 
this storage method. Code 261, for ex¬ 
ample, is stored as 258+(space) rather 
than “/s(space)”, which is the string that 
it represents. 



258 

(space) 


260<-(space)+/ 
261 <-258+(space) 


(space) 

a 


261 <-258+(space) 
262 <-(space)+a 


T 

h 
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Expanding LZW 

Like the dynamic Huffman algorithm 
described earlier, LZW coding does not 
require you to pass a decoding table to 
the expander along with the compressed 
data. The LZW expander can build its 
own table from nothing but the codes in 
the compressed data. 

The expansion program starts with a 
table, just like the compressor’s, with 
only literal data defined. It begins by 
reading the first character from the com¬ 
pressed input. It sends this character to 
the output, but otherwise it just holds 
onto the character to form the basis for 
the next string. 

For each code after the first that the 
expander reads, it generates a string and 
makes an update to the string table. The 
expander first uses the string table to 
translate the code value to an output 
string. For nonliteral codes, it back¬ 
tracks through the code/character com¬ 
binations of the string table, pushing 
characters onto a stack as it goes. When 
the expander reaches a literal code, it 
pops the stack to produce the output 
string. 

In addition, each code after the first 
one causes a table update. For the second 
code, the expander adds a code made up 
of the first code, plus the first character 
in the string described by the second 
code. For each code thereafter, the ex¬ 
pander adds the last code translated plus 
the first character in the current string to 
the table. The resulting table is an exact 
duplicate of the compression table, which 
changes with each code received. 

Welch describes a special-case situa¬ 
tion that complicates the expansion algo¬ 
rithm slightly. A certain type of string 
can cause the compressor to output a 
code before the expander has it in its 
table. This situation occurs when strings 
of the form XandXandX appear and the 
string Xand is already in the table. In this 
case, the compressor will send the code 
for Xand (because it already knows that 
string) and then add XandX to the table. 
It will then start with the middle X , find 
the next group of characters that it knows 
is XandX , and send the code for XandX 
before the expander knows its meaning. 

You can handle this special case by 
adding a few lines of code in the expan¬ 
der program. If the expander receives a 
code that it doesn’t recognize, it knows 
that it has encountered this singular case. 
In the above example, the expander re¬ 
ceives the code for Xand and then an 
unknown code. It writes out the last 
translated code again (Xand) and then 
the first character from that code (X). It 
then adds a combination of these charac¬ 


ters {XandX) to the table, which puts it 
back in sync with the compressor. 

Enhanced LZW 

Two enhancements to the basic LZW al¬ 
gorithm, variable-length codes and table 
clearing, make for a more flexible and 
robust compressor. 

With fixed-length output codes, you 
must decide up front how many bits to 
use for encoding the compressed data. If 
you use a small number of bits, the table 
fills quickly and compression drops off 
rapidly. If you use a large number of bits, 
the overhead for each code that you do 
not successfully compress is enormous. 

The sample code (see listing 2) uses 
variable-length codes to work around this 
problem. Initially, it uses 9-bit codes. 
When the compressor runs out of 9-bit 
codes, it switches to 10 bits, and on up 
through 13. It then uses 13-bit codes for 
the rest of the output. 

Listing 2 shows that the compressor in¬ 
creases the bit length when the next code 
to add to the table requires more bits than 
allowed by the current bit length. This is 
not the next code to output; the next code 
to output will be the code that matches 
the next part of the input. However, be¬ 
cause the expander and compressor use 
this same method for determining which 
code in the table to use next, they make 
the switch in bit sizes simultaneously. 

Even with a 13-bit table, the LZW 
compressor will eventually run out of 
string locations. One way to handle this 
problem is to stop adding entries and use 
the strings in the table to compress the 
rest of the input. This will result in poor 
compression if the type of data changes 
from one part of the input to another. 

You could also clear the table when it 
becomes full and start building the table 
again with the new data. Although this 
method makes the compressor more 
flexible than the do-nothing approach, it 
will also result in reduced compression 
while the table is mostly empty. 

Listing 2 uses a partial-clearing ap¬ 
proach to freshen the string table when it 
becomes full. The code clears only some 
of the older strings in the table when it 
becomes necessary. 

Because the string data is stored as 
base code/character combinations, you 
can’t merely keep track of the least fre¬ 
quently or least recently used strings in 
the table and later eliminate them when 
the table is full. You can eliminate only 
the nodes that are not used by other codes 
as base codes (the leaves). 

It makes sense to keep track of the age 
of each leaf by the number of bits re¬ 
quired to describe its code. When the 


table fills, you can remove all the 9-bit 
leaves and reuse their codes. When the 
table fills again, you can recycle the 10- 
bit leaf codes. Once the 13-bit leaves 
have been reused, you can go back to re¬ 
moving 9-bit leaves and continue in this 
manner indefinitely. 

To determine which node is a leaf and 
which is not, the table-clearing routine 
takes a relatively brute force approach. 
First, it marks all the nodes as leaves. It 
then goes through the table, looking at 
base codes. Each base code is the code of 
a node that is not a leaf, so it unmarks the 
node that corresponds to that code. 

Unfortunately, there is one more com¬ 
plication in finding the leaves to elimi¬ 
nate. The compressor stores its string 
table in a hashed array because it must 
try to find codes in the table knowing 
only their base codes and appended char¬ 
acters. For the clearing routine to find 
codes given the code itself, you can use a 
cross-index table that maps sorted codes 
to table locations. While this uses up a 
good chunk of memory (16K bytes for 
16-bit pointers and a 13-bit table), it pro¬ 
vides for quick table access by either the 
code or contents. 

The Sample Code 

To try out these two compression algo¬ 
rithms, I wrote two assembly routines 
designed to be called from programs 
written in C. (The full text of these rou¬ 
tines is available in electronic format. 
See page 5 for details.) Both take an in¬ 
put and output file handle, compressing 
data from the input file and writing it to 
the output file. 

Even if you don’t need to write your 
own compressor, a little background in 
data compression is useful. Although 
data compression may appear complex 
and fraught with danger, it’s actually 
valuable and reliable, as you can see. ■ 
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